Preprocessing

Figures

QC Metrics

Doublet Detection Summary

Summary of Detected Doublets by Sample
sample total_cells predicted_doublets percent_doublets
GSM5467406_HB17_background_filtered_feature_bc_matrix 1688 51 3.02
GSM5467407_HB17_tumor_filtered_feature_bc_matrix 7994 240 3.00
GSM5467408_HB17_PDX_filtered_feature_bc_matrix 8024 241 3.00
GSM5467409_HB30_tumor_filtered_feature_bc_matrix 18713 561 3.00
GSM5467410_HB30_PDX_filtered_feature_bc_matrix 10100 303 3.00
GSM5467411_HB53_background_filtered_feature_bc_matrix 8540 256 3.00
GSM5467412_HB53_tumor_filtered_feature_bc_matrix 12792 384 3.00

Pre/Post Filtering Summary

Number of Cells and Genes Before and After Filtering
sample total_cells avg_nFeature_RNA avg_nCount_RNA avg_percent_mt
GSM5467406_HB17_background_filtered_feature_bc_matrix 1688 1560 3740 6.84
GSM5467407_HB17_tumor_filtered_feature_bc_matrix 7994 3043 6730 0.78
GSM5467408_HB17_PDX_filtered_feature_bc_matrix 8024 3135 6201 0.46
GSM5467409_HB30_tumor_filtered_feature_bc_matrix 18713 2235 4431 0.54
GSM5467410_HB30_PDX_filtered_feature_bc_matrix 10100 2572 4844 0.50
GSM5467411_HB53_background_filtered_feature_bc_matrix 8540 3071 9455 0.19
GSM5467412_HB53_tumor_filtered_feature_bc_matrix 12792 3045 7679 0.96

Discussion: We applied filtering thresholds of >500 genes per cell, >800 total counts, and <10% mitochondrial content to remove low-quality or stressed cells, based on QC metric distributions and violin plots. Additionally, 2,036 predicted doublets (~3% of the dataset) were excluded to avoid confounding downstream analysis. This resulted in a high-quality dataset of 67,851 single cells across seven samples. In the literature, objective strategies such as median absolute deviation (MAD), EmptyDrops, and Gaussian mixture models have been proposed to set thresholds without relying on visual inspection. These methods improve reproducibility and are especially useful for large-scale single-cell workflows.

Normalization Procedure

Filtering Thresholds and Rationale

To ensure high-quality cells for downstream analysis, we applied the following filtering thresholds: - Minimum genes per cell: 200 - Maximum genes per cell: 6,000 - Maximum percent mitochondrial reads: 10% - Doublet removal: Cells identified by the doublet detection algorithm (predicted doublets) were excluded.

To ensure high-quality single-cell data, we filtered cells based on common thresholds: >200 genes, <6,000 genes, <10% mitochondrial content, and removed predicted doublets (~3%). This reduced noise from low-quality or multiplet cells, leaving 67,851 high-confidence cells across all samples. These thresholds were selected based on distribution plots but are also supported by objective methods like MAD, EmptyDrops, and Gaussian mixture models. We normalized the data using Seurat’s LogNormalize method to scale and log-transform counts, reducing technical variability and preparing the dataset for downstream analysis.

Feature Selection

Discussion:

We used Seurat’s FindVariableFeatures() function with the “vst” method to identify the top 2,000 highly variable genes based on standardized variance. This approach selects genes that exhibit more variability than expected given their average expression, enhancing the biological signal in downstream analyses. Of the total 33,538 genes, 2,000 were classified as highly variable and used for dimensionality reduction.

PCA

Discussion:

Clustering

Discussion:

We clustered the cells using Seurat’s graph-based clustering approach with a resolution of 0.8, which resulted in 16 distinct clusters. Visualization with UMAP shows well-separated clusters labeled in the umap_clusters.png plot. According to our QC summary, the dataset contains a total of 67,851 cells across seven samples, with sample sizes ranging from 1,688 to 18,713 cells. The UMAP colored by sample origin (umap_by_sample.png) shows clear batch separation between samples. Given the visible sample-specific grouping, we will consider performing data integration to mitigate potential batch effects and improve biological interpretability.

Marker Gene Analysis

Top 5 Marker Genes for Each Cluster
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
0 2.584215 0.645 0.467 0.00e+00 0 DLK1
0 2.522924 0.484 0.369 0.00e+00 0 LYZ
0 2.522325 0.331 0.247 0.00e+00 0 SLC35F1
0 2.519760 0.492 0.390 0.00e+00 0 MT-ND6
0 2.490280 0.784 0.631 0.00e+00 0 STEAP1B
0 3.094699 0.592 0.184 0.00e+00 1 GPC5
0 2.679495 0.815 0.374 0.00e+00 1 BHMT
0 2.650181 0.253 0.050 0.00e+00 1 SLC6A1-AS1
0 2.533652 0.414 0.106 0.00e+00 1 KIF6
0 2.499194 0.395 0.078 0.00e+00 1 CYP3A7
0 1.357827 0.362 0.189 0.00e+00 2 AGBL4
0 1.315979 0.416 0.250 0.00e+00 2 ACSM3
0 1.310377 0.256 0.111 0.00e+00 2 GABRB1
0 1.301737 0.372 0.213 0.00e+00 2 ATP2B2
0 1.289034 0.322 0.160 0.00e+00 2 AC116049.2
0 5.409732 0.305 0.007 0.00e+00 3 AC007221.2
0 5.308867 0.375 0.006 0.00e+00 3 LINC01488
0 5.256976 0.785 0.063 0.00e+00 3 CYP2B6
0 5.242876 0.638 0.019 0.00e+00 3 LINC02197
0 5.202082 0.762 0.026 0.00e+00 3 ADRA1A
0 3.673902 0.602 0.043 0.00e+00 4 KIF18B
0 3.559132 0.397 0.026 0.00e+00 4 C2orf48
0 3.542532 0.333 0.024 0.00e+00 4 PIF1
0 3.418947 0.604 0.059 0.00e+00 4 CDC25C
0 3.379527 0.685 0.069 0.00e+00 4 AC091057.6
0 3.597156 0.457 0.131 0.00e+00 5 MYH7B
0 3.334317 0.358 0.065 0.00e+00 5 EPHA6
0 3.324848 0.453 0.089 0.00e+00 5 SHISA6
0 3.227684 0.290 0.082 0.00e+00 5 LRRC4C
0 3.167399 0.279 0.039 0.00e+00 5 TLL2
0 6.262944 0.859 0.066 0.00e+00 6 PTPRB
0 6.208756 0.884 0.097 0.00e+00 6 ST6GALNAC3
0 6.196948 0.394 0.013 0.00e+00 6 NOTCH4
0 6.158052 0.284 0.012 0.00e+00 6 CLEC4M
0 6.136256 0.396 0.012 0.00e+00 6 BTNL9
0 6.583390 0.406 0.017 0.00e+00 7 IGSF21
0 6.271785 0.458 0.015 0.00e+00 7 FGD2
0 6.107157 0.480 0.022 0.00e+00 7 ITGAX
0 5.922977 0.357 0.030 0.00e+00 7 AC100849.1
0 5.909372 0.446 0.023 0.00e+00 7 ADGRE2
0 7.571497 0.334 0.009 0.00e+00 8 PLA2G5
0 7.551198 0.399 0.008 0.00e+00 8 PRR16
0 7.249853 0.424 0.017 0.00e+00 8 CCBE1
0 7.162164 0.282 0.008 0.00e+00 8 CDH19
0 7.082625 0.319 0.006 0.00e+00 8 PTH1R
0 3.180570 0.369 0.169 0.00e+00 9 BICC1
0 2.366102 0.371 0.201 0.00e+00 9 FAM13A
0 2.356580 0.309 0.114 0.00e+00 9 FAM160A1
0 2.311444 0.625 0.284 0.00e+00 9 PRAG1
0 2.211793 0.304 0.118 0.00e+00 9 TUFT1
0 7.888640 0.651 0.020 0.00e+00 10 CD247
0 7.324957 0.322 0.010 0.00e+00 10 BCL11B
0 7.159468 0.312 0.008 0.00e+00 10 ITK
0 6.883561 0.290 0.012 0.00e+00 10 PYHIN1
0 6.845137 0.451 0.020 0.00e+00 10 SLFN12L
0 2.224599 0.408 0.075 0.00e+00 11 CD163L1
0 2.042150 0.261 0.025 0.00e+00 11 FPR3
0 2.028633 0.521 0.069 0.00e+00 11 FYB1
0 2.022975 0.309 0.036 0.00e+00 11 TFEC
0 1.995229 0.377 0.037 0.00e+00 11 SYK
0 1.443023 0.299 0.165 0.00e+00 12 CST4
0 1.229287 0.554 0.339 0.00e+00 12 HS3ST5
0 1.220649 0.531 0.304 0.00e+00 12 AC073050.1
0 1.187433 0.299 0.229 2.27e-05 12 PDZRN3
0 1.177465 0.397 0.279 0.00e+00 12 PIP5K1B
0 3.881955 0.435 0.041 0.00e+00 13 SNTG2
0 3.684900 0.374 0.027 0.00e+00 13 VWA2
0 3.600627 0.735 0.100 0.00e+00 13 NTM
0 3.558258 0.380 0.029 0.00e+00 13 ASB18
0 3.503418 0.406 0.030 0.00e+00 13 BRINP1
0 4.211632 0.387 0.069 0.00e+00 14 TAT
0 3.419895 0.384 0.094 0.00e+00 14 HRG
0 3.313934 0.274 0.075 0.00e+00 14 IL1RN
0 3.248064 0.630 0.148 0.00e+00 14 SAA1
0 3.232984 0.349 0.107 0.00e+00 14 HSD17B6
0 2.915517 0.365 0.032 0.00e+00 15 AL591686.2
0 2.760190 0.414 0.043 0.00e+00 15 SNTG2
0 2.694599 0.353 0.029 0.00e+00 15 CCDC148-AS1
0 2.397245 0.916 0.177 0.00e+00 15 TMEFF2
0 2.389508 0.422 0.058 0.00e+00 15 LINC00278
0 5.999050 0.257 0.037 0.00e+00 16 FCN3
0 4.772774 0.303 0.056 0.00e+00 16 STAB1
0 4.190372 0.407 0.063 0.00e+00 16 EGFL7
0 4.133417 0.257 0.084 0.00e+00 16 CD4
0 4.023443 0.270 0.071 0.00e+00 16 IGFBP3
0 10.211202 0.756 0.004 0.00e+00 17 SPTA1
0 9.702469 0.594 0.002 0.00e+00 17 EPB42
0 9.541239 0.650 0.005 0.00e+00 17 RHAG
0 9.413369 0.500 0.002 0.00e+00 17 GYPB
0 9.213602 0.481 0.004 0.00e+00 17 GYPE
0 8.424154 0.696 0.030 0.00e+00 18 RIMS2
0 8.343843 0.412 0.010 0.00e+00 18 LINC01811
0 8.261078 0.304 0.003 0.00e+00 18 SLC18A1
0 8.212629 0.662 0.029 0.00e+00 18 KCNB2
0 8.075693 0.716 0.018 0.00e+00 18 RIMBP2

Discussion:

We performed marker gene analysis using the Wilcoxon rank sum test, as implemented in Seurat’s FindAllMarkers() function. This method identifies genes that are significantly more highly expressed in each cluster relative to all others.

Manual Labeling & Plots

Discussion

We used the Wilcoxon rank sum test to identify marker genes for each cluster. This test is commonly used in single-cell RNA-seq analysis due to its robustness to non-normal distributions and outliers, making it well-suited to sparse single-cell data. An advantage of this method is its simplicity and interpretability. However, it may have limited power in detecting subtle gene expression differences and can be conservative when many cells have zero counts for a given gene.

To assign cell type identities to each cluster, we reviewed the top marker genes and compared them to established gene signatures reported in hepatoblastoma studies. Clusters with high expression of genes such as GPC3, DLK1, and HMGA2 were labeled as tumor cells, as these markers are commonly associated with proliferative, fetal-like hepatoblasts in hepatoblastoma. Clusters enriched for CYP3A4, ALB, and HPGD were annotated as hepatocytes, given their role in mature liver function. We identified endothelial cells based on expression of vascular markers like FLT1 and PECAM1, while clusters expressing COL3A1 and COL6A3 were labeled as hepatic stellate cells due to their association with extracellular matrix production. Immune-related clusters were identified by the presence of markers such as PTPRC, CD3D, CD163, and CD68, allowing us to distinguish between T/NK cells, B cells, and macrophage-like Kupffer cells. A subset of cells expressing KRT19 and EPCAM was annotated as cholangiocytes. Some clusters remained ambiguous due to mixed or intermediate marker expression, and were labeled accordingly. These assignments are supported by findings from Zhang et al. (2022), who performed similar cell type classification in hepatoblastoma using single-cell RNA sequencing

First Replicate: Cell Proportion Analysis

Discussion

In our the single-cell RNA-seq dataset, it was observed that there is a prominent representation of tumor cells, suggesting that transformed hepatoblast-like populations dominate the cellular composition of these hepatoblastoma samples. This aligns with previous studies indicating that hepatoblastoma tumors are often composed of immature hepatic progenitor-like cells that express markers such as GPC3 and DLK1, reflective of their fetal liver origin and proliferative capacity [Zhang et al., 2022, DOI: 10.1038/s42003-021-02562-8]. Hepatocytes and endothelial cells were also well represented, which may reflect both residual normal tissue and tumor-associated microenvironments. The presence of Kupffer cells, NK/T cells, and B cells underscores the role of the immune microenvironment in hepatoblastoma, with recent studies suggesting that tumor-associated macrophages and lymphocyte infiltration may influence prognosis and treatment response [Chen et al., 2020, DOI: 10.1158/1078-0432.CCR-20-1255]. The detection of cholangiocytes and stellate cells further highlights the cellular heterogeneity of the liver and suggests that non-parenchymal cells may contribute to tumor development or fibrosis within the tumor niche. These proportions provide a high-resolution view of the hepatoblastoma microenvironment and set the stage for further subtype- and cell-type-specific analyses.

Second Replicate: Cell Signaling

Discussion

To explore how different cell populations interact within the hepatoblastoma tumor microenvironment, we generated a simplified network highlighting predicted signaling pathways among major cell types. The network revealed notable communication from tumor cells to both immune and endothelial populations, suggesting possible mechanisms of tumor-induced angiogenesis and immune modulation. Additionally, stromal cells, such as stellate cells, appeared to interact with both tumor and vascular cell types, aligning with their known roles in structural remodeling and tissue support. These observations echo patterns reported by Liu et al. (2022), who identified enriched pathways like CXCL and WNT as central to tumor progression in hepatoblastoma (Communications Biology, DOI: 10.1038/s42003-021-02562-8). Overall, this analysis provides an overview of key signaling relationships that may influence tumor behavior and the surrounding cellular landscape.

Personal Analysis: Cell Cycle Phase Analysis

Discussion:

There is analysis of the cell cycle phase distributions across clusters to investigate proliferative activity in the tumor microenvironment. Clusters enriched for S or G2/M phase suggest populations undergoing active division, which may indicate proliferating tumor or progenitor cells. This information complements existing annotations and may help identify subpopulations with stem-like or aggressive characteristics. Similar approaches have been used to reveal proliferative states in tumor scRNA-seq datasets (Macosko et al., 2015). Future work could explore how these proliferative clusters respond to treatment or correlate with prognosis.